library(knitr)
library(rmarkdown)
library(ggplot2)
library(plotly)
library(data.table)
library(treemap)
library(dplyr)
library(corrplot)
library(qqplotr)
library(car)
The World Happiness Report is a landmark survey of the state of global happiness. The report continues to gain global recognition as governments, organizations and civil society increasingly use happiness indicators to inform their policy-making decisions. Leading experts across fields – economics, psychology, survey analysis, national statistics, health, public policy and more – utilize these measurements of well-being to effectively assess the progress of nations.
The happiness scores and rankings use data from the Gallup World Poll. The scores are based on answers to the main life evaluation question asked in the poll. This question, known as the Cantril ladder, asks respondents to think of a ladder with the best possible life for them being a 10 and the worst possible life being a 0 and to rate their own current lives on that scale.The scores are from nationally representative samples for the year 2019 and used the Gallup weights to make the estimates representative.
The happiness reports review the state of happiness in the world today and show how the new science of happiness explains personal and national variations in happiness.These scores are independent of the scores reported for each country, but they do explain why some countries rank higher than others. The happiness rankings are determined from nationally representative samples with a typical sample size of 1,000 people per nation to make the estimates representative. All numeric data is scaled between 1 to 10. In this project, we explore the relation between these calculated happiness metric’s and how they affect the happiness score.
Data is collected by SDSN and is available online here
For this project, data was dowloaded as a csv file.
happiness = read.csv("/Users/vigneshsankardas/Desktop/Class/Projects/555_HappinessAnalysis/Raw_Data/Happiness_2019.csv")
WorldHappinessReport2019<-data.frame(happiness)
colnames(WorldHappinessReport2019)
## [1] "Overall.rank" "Country.or.region"
## [3] "Score" "GDP.per.capita"
## [5] "Social.support" "Healthy.life.expectancy"
## [7] "Freedom.to.make.life.choices" "Generosity"
## [9] "Perceptions.of.corruption"
Each case represents a country from around a world. There are 156 observations in the given data set from 2019.
The data set contains 9 variables:
All numeric parameters above are scaled between 1 to 10.
Data Set Preview :-
paged_table(WorldHappinessReport2019)
colSums(is.na(WorldHappinessReport2019))
## Overall.rank Country.or.region
## 0 0
## Score GDP.per.capita
## 0 0
## Social.support Healthy.life.expectancy
## 0 0
## Freedom.to.make.life.choices Generosity
## 0 0
## Perceptions.of.corruption
## 0
There are no missing values in the original data set.
WorldHappinessReport2019 <- WorldHappinessReport2019 %>% rename(Country = "Country.or.region")
WHI_2019 <- data.table(WorldHappinessReport2019) ;
WHI_2019[, Area := ifelse(Country %in% c("Bahamas","Barbados", "Belize",
"Canada", "Costa Rica", "Cuba",
"Dominica","Dominican Republic",
"El Salvador","Grenada", "Guatemala",
"Haiti","Honduras","Jamaica",
"Mexico","Nicaragua","Panama",
"Saint Kitts and Nevis","Saint
Lucia","Trinidad and Tobago",
"United States"),"West",
"Rest of the World")]
We add an extra parameter to classily the Countries in the West from the rest of the world for analysis purposes.
Data set sample after adding the new Area Classification parameter.
head(paged_table(WHI_2019[,c(10,1:9)]), 10)
Numeric Columns :
num_cols = unlist(lapply(WorldHappinessReport2019, is.numeric)) ; num_cols
## Overall.rank Country
## TRUE FALSE
## Score GDP.per.capita
## TRUE TRUE
## Social.support Healthy.life.expectancy
## TRUE TRUE
## Freedom.to.make.life.choices Generosity
## TRUE TRUE
## Perceptions.of.corruption
## TRUE
num = WorldHappinessReport2019[ , num_cols]
library(Hmisc)
hist.data.frame(num)
Score follows an approximately normal distribution.
Social.support, healthy.life.expectancy and Freedom.to.make.life.choices have left skewed distribution, meaning the medians of these two features will be greater than their means.
Generosity and Perceptions.of.corruption have slightly right skewed distribution.
WHI_2019 %>% group_by(Area) %>% summarise(Min = min(Score,na.rm = TRUE),
Q1 = quantile(Score,probs = .25,na.rm = TRUE),
Median = median(Score,na.rm = TRUE),
Q3 = quantile(Score,probs = .75,na.rm = TRUE),
Max = max(Score,na.rm = TRUE),
Mean = mean(Score, na.rm = TRUE),
SD = sd(Score, na.rm = TRUE),
n = n(),
Missing = sum(is.na(Score))) -> table1
knitr::kable(table1)
| Area | Min | Q1 | Median | Q3 | Max | Mean | SD | n | Missing |
|---|---|---|---|---|---|---|---|---|---|
| Rest of the World | 2.853 | 4.51825 | 5.2795 | 6.11975 | 7.769 | 5.345056 | 1.1045713 | 144 | 0 |
| West | 3.597 | 5.88250 | 6.2870 | 6.66925 | 7.278 | 6.151583 | 0.9711304 | 12 | 0 |
A review of the summary statistics indicate that the mean score of the West is higher than the rest of the world.
The median for the West is approximately 0.1 points higher than the mean indicating a left skew in the distribution
The minimum for the Rest of the world is more than 0.7 points lower than the west.
WHI_2019 %>% boxplot(Score ~ Area, data = ., ylab = "Happiness Score", col=c('pink', 'sky blue'))
Box plot shows that there is a difference in the mean happiness scores between the West and rest of the world.
There is one outlier in the West group.
top20<-WHI_2019 %>% filter(Overall.rank<=20)%>% arrange(desc(Score))
top20$label<-paste(top20$Country,top20$Overall.rank ,top20$Score ,sep="\n ")
options(repr.plot.width=12, repr.plot.height=8)
treemap(top20,
index=c("label"),
vSize="Score",
vColor="Overall.rank",
type="value",
title="Top 20 Happiness Countries -2019",
palette=terrain.colors(20),
command.line.output = TRUE,
format.legend = list(scientific = FALSE, big.mark = " "))
corrplot(cor(WHI_2019 %>%
select(Score:Perceptions.of.corruption)),
method="color",
sig.level = 0.01, insig = "blank",
addCoef.col = "black",
tl.srt=45,
type="lower"
)
HappinessScore has high positive correlation with ‘GDP.per.capita’ , ‘Social.support’, ‘Healthy.life.expectancy’ and has very low correlation with ‘Generosity’ .
‘Healthy.life.expectancy’ has very high positive correlation of 0.84 with ‘GDP.per.capita’.
While not necessary for this analysis due to the sample sizes being greater than 30 (n = 31 and n = 125 for West and Rest of the World respectively), we plot Q-Q plots for both groups to investigate for normality.
HS_west <- WHI_2019 %>% filter(Area == "West")
HS_west$Score %>% qqPlot(dist="norm", main = "QQ plot - Happiness Scores - West", col= 'dark blue', col.lines = 'sky blue')
## [1] 12 1
HS_ROW<- WHI_2019 %>% filter(Area == "Rest of the World") ;
HS_ROW$Score %>% qqPlot(dist="norm", main = "QQ plot - Happiness Scores -Rest of the World", col= 'red', col.lines = 'pink')
## [1] 144 1
The data points fall close to the diagonal lines for both of the two groups indicating overall normally distributed.
However an ‘s’ shape is observed in both groups indicating non-normality
This does not matter however as per the Central Limit Theorem; when the sample size is large, the sampling distribution of a mean will be approximately normally distributed, regardless of the underlying population distribution
Leven’s Test - funciton in R to compare the variances of the two groups
If the variances between the two groups are not equal then we would expect a statistically significant difference in the output of the levenTest()
leveneTest(Score ~ Area , data = WHI_2019)
The p-value for the Levene’s test of equal variance is 0.0826 (> 0.05)
Since the p-value > 0.05, we assume equal variance for both.
Null Hypothesis : H0:μ1−μ2=0; The difference in the mean happiness score between the West vs Rest of the world is 0
Alternate Hypothesis : H1:μ1−μ2≠0; The difference in the mean happiness score between the West vs Rest of the world is not 0
Significance Level = 0.05
Decision Rule : p-value > 0.05 ; Reject the null hypothesis
t.test(
WHI_2019$Score ~ WHI_2019$Area,
data = WHI_2019,
var.equal = TRUE,
alternative = "two.sided"
)
##
## Two Sample t-test
##
## data: WHI_2019$Score by WHI_2019$Area
## t = -2.4501, df = 154, p-value = 0.0154
## alternative hypothesis: true difference in means between group Rest of the World and group West is not equal to 0
## 95 percent confidence interval:
## -1.4568199 -0.1562356
## sample estimates:
## mean in group Rest of the World mean in group West
## 5.345056 6.151583
Using the p-value method to test the hypothesis, as the p-value = 0.0154 < 0.05, we fail to reject H0.
There is no statistically significant difference between the means of the happiness scores between the West and the Rest of the World.
plot_ly(data = WHI_2019,
x=~Generosity, y=~Score, color=~Generosity, type = "scatter",
text = ~paste("Country:", Country)) %>%
layout(title = "Happiness and Generocity ",
xaxis = list(title = "Generosity"),
yaxis = list(title = "Happiness Score"))
The dependent variable ‘Score’ and independent variable ‘Generosity’ are quantitative parameters.
R^2
model <- lm( Score ~ Generosity,
data = WHI_2019)
summary(model)$adj.r.squared
## [1] -0.0007069411
summary(model)
##
## Call:
## lm(formula = Score ~ Generosity, data = WHI_2019)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.56930 -0.81851 0.00815 0.78707 2.39012
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.2433 0.1951 26.872 <2e-16 ***
## Generosity 0.8861 0.9390 0.944 0.347
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.114 on 154 degrees of freedom
## Multiple R-squared: 0.005749, Adjusted R-squared: -0.0007069
## F-statistic: 0.8905 on 1 and 154 DF, p-value: 0.3468
Linear Model : HappinessScore = 5.2433 + (0.8861 x Generosity)
par.orig <- par(mfrow=c(2,2))
plot(log(WHI_2019$Score), resid(model), main="Predictors vs Residuals")
abline(0,0)
plot(fitted(model), resid(model),main="Fitted vs Residuals", xlab="Fitted Values")
abline(0,0)
qqnorm(resid(model), main="QQ-Plot of Residuals")
qqline(resid(model))
hist(resid(model), main="Histogram of Residuals")
## [1] "Country with highest Generosity of 0.566 is Myanmar"
## [1] "Country with lowest Generosity of 0 is Greece"
After analyzing data of Global Happiness Levels in the world, created by the United Nations Sustainable Development Solutions Network,I was able to discover that there is no significant impact of the generosity factor in determining “happiness”.
The results of the two-sample t-test assuming equal variance did not find a statistically significant difference between the mean happiness scores of the West and the rest of the world, t(df=154)=−2.45, p=0.0154, 95% CI for the difference in means [-1.4568199, -0.1562356].
HappinessScore has high positive correlation with ‘GDP.per.capita’ , ‘Social.support’ , meaning richer the Country, happier the people.
The results of the investigation suggest that the West does not have a significantly higher average happiness score than the rest of the world.